1 Introduction

In this evaluation, there are total 3 data tables. We used the evaluation metrics implemented in OmicsEV package to evaluate these data tables. The sample and class information for each data table are shown in the table below.

class CDAP MQ_ratio paper
Basal 11 11 11
Her2 8 8 8
LumA 12 12 12
LumB 18 18 18
None 14 14 14

The detailed sample information is shown below.

sample class batch order
TCGA.AO.A12D None 1 1
TCGA.C8.A131 Basal 1 2
TCGA.AO.A12B None 1 3
TCGA.E2.A10A LumA 1 4
TCGA.C8.A130 LumB 1 5
TCGA.C8.A138 Her2 1 6
TCGA.E2.A154 LumA 1 7
TCGA.A8.A09I LumB 1 8
TCGA.C8.A12L Her2 1 9
TCGA.A2.A0EX LumA 1 10
TCGA.AN.A04A None 1 11
TCGA.BH.A0AV Basal 1 12
TCGA.A2.A0D0 Basal 1 13
TCGA.C8.A12T Her2 1 14
TCGA.A8.A06Z LumB 1 15
TCGA.A2.A0D1 None 1 16
TCGA.A2.A0CM Basal 1 17
TCGA.A2.A0YI LumA 1 18
TCGA.A2.A0EQ Her2 1 19
TCGA.AR.A0TY LumB 1 20
TCGA.AR.A0U4 None 1 21
TCGA.BH.A0HP LumA 1 22
TCGA.BH.A0EE Her2 2 23
TCGA.AO.A0J9 None 2 24
TCGA.AN.A0FK LumA 2 25
TCGA.AO.A0J6 None 2 26
TCGA.A7.A13F LumB 2 27
TCGA.A7.A0CE Basal 2 28
TCGA.A2.A0YC LumA 2 29
TCGA.AO.A0JC None 2 30
TCGA.AR.A0TX Her2 2 31
TCGA.D8.A13Y LumB 2 32
TCGA.A8.A076 LumB 2 33
TCGA.AO.A126 None 2 34
TCGA.C8.A12P Her2 2 35
TCGA.BH.A0C1 LumA 2 36
TCGA.A2.A0EY LumB 2 37
TCGA.AR.A1AW LumB 2 38
TCGA.AR.A1AV LumA 2 39
TCGA.C8.A135 Her2 2 40
TCGA.A2.A0EV LumA 2 41
TCGA.AN.A0AM LumB 2 42
TCGA.D8.A142 Basal 2 43
TCGA.AN.A0FL Basal 3 44
TCGA.AN.A0AS LumA 3 45
TCGA.AR.A0TV LumB 3 46
TCGA.C8.A12Z Her2 3 47
TCGA.AO.A0JJ None 3 48
TCGA.AO.A0JE None 3 49
TCGA.A2.A0T2 Basal 3 50
TCGA.AN.A0AJ LumB 3 51
TCGA.A7.A0CJ LumB 3 52
TCGA.AO.A12F None 3 53
TCGA.A2.A0YL LumA 3 54
TCGA.A2.A0T7 LumA 3 55
TCGA.C8.A12Q Her2 3 56
TCGA.A8.A079 LumB 3 57
TCGA.E2.A159 Basal 3 58
TCGA.A2.A0T3 LumB 3 59
TCGA.A2.A0YD LumA 3 60
TCGA.AR.A0TR LumA 3 61
TCGA.AO.A03O None 3 62
TCGA.AO.A12E None 3 63
TCGA.A8.A06N LumB 3 64
TCGA.A2.A0T1 Her2 3 65
TCGA.A2.A0YG LumB 3 66
TCGA.E2.A150 Basal 3 67
TCGA.A7.A0CD LumA 4 68
TCGA.C8.A12W LumB 4 69
TCGA.AN.A0AL Basal 4 70
TCGA.A2.A0T6 LumA 4 71
TCGA.AO.A0JM None 4 72
TCGA.C8.A12V Basal 4 73
TCGA.A2.A0D2 Basal 4 74
TCGA.C8.A12U LumB 4 75
TCGA.A8.A09G Her2 4 76
TCGA.C8.A134 Basal 4 77
TCGA.A2.A0YF LumA 4 78
TCGA.BH.A0E9 LumA 4 79
TCGA.AR.A0TT LumB 4 80
TCGA.AR.A1AQ Basal 4 81
TCGA.A2.A0SW LumB 4 82
TCGA.AO.A0JL None 4 83
TCGA.A2.A0YM Basal 4 84
TCGA.BH.A0C7 LumB 4 85
TCGA.A2.A0SX Basal 4 86

2 Overview

The table below provides an overview about all the quantitative metrics generated in the evaluation. For each metric, the value of the best data table is highlighted. The detail of each metric can be found in corresponding section below.

metric CDAP MQ_ratio paper
#identified features 10625
(0.5212)
11368
(0.5576)
10062
(0.4936)
#quantifiable features 9465
(0.4643)
9492
(0.4656)
9227
(0.4526)
non_missing_value_ratio 0.9505 0.9493 0.9397
data_dist_similarity 0.9851 0.9819 0.9188
silhouette_width -0.1857
(0.8143)
-0.3020
(0.6980)
-0.4237
(0.5763)
pcRegscale 0.0465
(0.9535)
0.0000
(1.0000)
0.0000
(1.0000)
complex_auc 0.7486 0.7567 0.7438
func_auc 0.7655 0.7532 0.8227
class_auc 0.7730 0.7804 0.7434
gene_wise_cor 0.4203 0.4235 0.3784
sample_wise_cor 0.1916 0.2125 0.1783

The radar plot showing below is generated based on the data in the above overview table. To generate the radar plot, a metric is converted to a scale in which the value range is between 0 and 1 in a way that higher value indicates better data quality if necessary. The converted values are in parentheses.

3 Data depth

3.1 Study-wise

The table below shows the number of identified proteins or genes for each data table. We take the proteins or genes filtered by 50% missing value as quantified proteins or genes. The values in parentheses are the percentage of proteins or genes identified or quantified based on the total number of proteins or genes (20386) in the study species.

data table #identified features #quantifiable features
CDAP 10625
(52.12%)
9465
(46.43%)
MQ_ratio 11368
(55.76%)
9492
(46.56%)
paper 10062
(49.36%)
9227
(45.26%)

Upset chart below showing overlap in proteins or genes identified in each data table. Numbers of identified proteins or genes shared between different data tables are indicated in the top bar chart and the specific data tables in each set are indicated with solid points below the bar chart. Total identifications for each data table are indicated on the left as ‘Set size’.

3.2 Sample-wise

The figures below show the number of proteins or genes identified in each sample. Only when the quantification value of a gene or protein is not “NA” in a sample, this gene or protein is considered as identified in the sample. The samples from different batches are coded in different shapes and the samples from different classes are coded in different colors.

CDAPMQ_ratiopaper

3.3 Missing value distribution

The missing value distribution can give an overview of the percent of missing values of all proteins or genes in both the QC and experiment samples.

data table non_missing_value_ratio
CDAP 0.9505
MQ_ratio 0.9493
paper 0.9397

CDAPMQ_ratiopaper

4 Data normalization

4.1 Boxplot

The boxplots show the protein or gene expression distribution across samples. X axis is sample ordered by input order. Y axis is log2 transformed protein or gene expression. The samples from different classes are coded in different colors.

CDAPMQ_ratiopaper

To quantify the normalization effect, for each pair of samples, perform an AUROC test to quantify the ability of feature abundance to distinguish the two samples and then generate a score based on 1-2*abs(AUROC-0.5), which will be 0 to 1, higher the better (no systematic difference between the two samples). The final metric for each data table is the median of scores from all sample pairs.

data table data_dist_similarity n
CDAP 0.9851 1953
MQ_ratio 0.9819 1953
paper 0.9188 1953

4.2 Density plot

The density plots show the protein or gene expression distribution across samples. X axis is log2 transformed protein or gene expression. Y axis is density.

5 Batch effect

5.1 Silhouette width

The silhouette width s(i) ranges from –1 to 1, with s(i) -> 1 if two clusters are separate and s(i) -> −1 if two clusters overlap but have dissimilar variance. If s(i) -> 0, both clusters have roughly the same structure. Thus, we use the absolute value |s| as an indicator for the presence or absence of batch effects.

data table silhouette_width
CDAP -0.1857
MQ_ratio -0.3020
paper -0.4237

5.2 PCA with batch annotation

For each PC, we calculate Pearson’s correlation coefficient with batch covariate b:

ri =corr(PCi,b)

In a linear model with a single dependent, as is the case here for the PCs correlated to batch covariate, the coefficient of determination R2 is the squared Pearson’s correlation coefficient:

R2(PCi,b) = ri2

Then we estimate the significance of the correlation coefficient either with a t-test or a one-way ANOVA. The R2 value highlighted with red is significant (p-value <= 0.05).

PC CDAP MQ_ratio paper
1 0.024 0.025 0.007
2 0.021 0.036 0.044
3 0.013 0 0.004
4 0 0.001 0.018
5 0.003 0.001 0.001
6 0.049 0.049 0.053
7 0.023 0.029 0.034
8 0.008 0.025 0
9 0.135 0.064 0.001
10 0.007 0.005 0.009

The fraction of variance explained for each PC:

PC CDAP MQ_ratio paper
1 12.4 12.5 11.6
2 7.7 8.0 8.2
3 7.1 7.1 7.2
4 4.5 4.5 4.0
5 4.1 4.0 4.0
6 3.7 3.7 3.4
7 3.0 3.1 2.6
8 2.5 2.5 2.4
9 2.3 2.2 2.3
10 2.2 2.2 2.3

‘Scaled PC regression’, i.e. total variance of PCs which correlate significantly with batch covariate (FDR<0.05) scaled by the total variance of 10 PCs:

data table pcRegscale
CDAP 0.0465
MQ_ratio 0.0000
paper 0.0000

In these figures, each column is a sample, each row is also a sample. The color indicates the correlation between samples. The samples are ordered by batches.

5.3 Correlation heatmap

In these figures, each column is a sample, each row is also a sample. The color indicates the correlation between samples. The samples are ordered by batches.

CDAPMQ_ratiopaper

6 Biological signal

6.1 Correlation among protein complex members

The table showing below is a summary of the evaluation. ‘diff’ is Cor(intra) - Cor(inter). ‘complex_auc’ is the AUROC value based on correlation of protein pairs from different groups.

data table InterComplex IntraComplex diff complex_auc
CDAP 0.0061 0.2261 0.2200 0.7486
MQ_ratio 0.0051 0.2331 0.2281 0.7567
paper 0.0153 0.2254 0.2101 0.7438
RNA 0.0247 0.1500 0.1252 0.6549

6.2 Gene function prediction

In this evaluation, each data table was used to build co-expression network. For a selected network and a selected function term (such as GO or KEGG), proteins/genes annotated to the term and also included in the network were defined as a positive protein/gene set and other proteins/genes in the network constituted the negative protein/gene set for the term. For a selected function term, we use some of the proteins/genes as the seed protein/gene, then we use random walk algorithm to calculate scores for other proteins/genes. A higher score of a protein/gene represents a closer relationship between the protein/gene and the seed proteins/genes. Finally, for each selected function term, we calculate an AUROC to evaluate the prediction performance.

CDAP MQ_ratio paper RNA
Acute myeloid leukemia 0.622 0.603 0.808 0.618
Adherens junction 0.593 0.649 0.682 0.539
Adipocytokine signaling pathway 0.602 0.626 0.758 0.594
Alanine, aspartate and glutamate metabolism 0.806 0.789 0.724 0.575
Aldosterone-regulated sodium reabsorption 0.792 0.685 0.881 0.634
Alzheimers disease 0.78 0.807 0.789 0.681
Amino sugar and nucleotide sugar metabolism 0.705 0.761 0.775 0.651
Aminoacyl-tRNA biosynthesis 0.811 0.718 0.799 0.735
Amoebiasis 0.666 0.691 0.784 0.606
Amyotrophic lateral sclerosis (ALS) 0.68 0.57 0.664 0.606
Antigen processing and presentation 0.698 0.785 0.843 0.833
Apoptosis 0.6 0.635 0.69 0.594
Arachidonic acid metabolism 0.69 0.714 0.626 0.746
Arginine and proline metabolism 0.632 0.773 0.683 0.668
Arrhythmogenic right ventricular cardiomyopathy (ARVC) 0.698 0.845 0.831 0.586
Axon guidance 0.552 0.653 0.592 0.607
B cell receptor signaling pathway 0.721 0.686 0.76 0.546
Bacterial invasion of epithelial cells 0.685 0.633 0.744 0.544
Base excision repair 0.776 0.627 0.633 0.727
beta-Alanine metabolism 0.681 0.753 0.748 0.591
Bladder cancer 0.67 0.642 0.604 0.553
Calcium signaling pathway 0.647 0.695 0.761 0.58
Carbohydrate digestion and absorption 0.723 0.67 0.905 0.775
Cardiac muscle contraction 0.835 0.855 0.919 0.705
Cell adhesion molecules (CAMs) 0.801 0.735 0.819 0.788
Cell cycle 0.716 0.704 0.82 0.751
Chagas disease (American trypanosomiasis) 0.64 0.622 0.779 0.56
Chemokine signaling pathway 0.611 0.673 0.669 0.592
Chronic myeloid leukemia 0.565 0.58 0.729 0.585
Citrate cycle (TCA cycle) 0.964 0.953 0.914 0.753
Colorectal cancer 0.582 0.558 0.611 0.576
Complement and coagulation cascades 0.909 0.918 0.901 0.874
Cysteine and methionine metabolism 0.674 0.659 0.663 0.632
Cytokine-cytokine receptor interaction 0.645 0.709 0.61 0.656
Cytosolic DNA-sensing pathway 0.708 0.668 0.676 0.651
Dilated cardiomyopathy 0.723 0.82 0.745 0.653
DNA replication 0.851 0.824 0.862 0.753
Drug metabolism - other enzymes 0.743 0.766 0.721 0.672
ECM-receptor interaction 0.895 0.906 0.855 0.765
Endocytosis 0.57 0.634 0.604 0.566
Endometrial cancer 0.628 0.662 0.686 0.556
Epithelial cell signaling in Helicobacter pylori infection 0.634 0.596 0.582 0.601
ErbB signaling pathway 0.64 0.604 0.624 0.573
Fatty acid metabolism 0.824 0.827 0.771 0.645
Fc epsilon RI signaling pathway 0.694 0.599 0.821 0.621
Fc gamma R-mediated phagocytosis 0.657 0.649 0.762 0.581
Focal adhesion 0.699 0.745 0.759 0.654
Fructose and mannose metabolism 0.739 0.753 0.807 0.629
Galactose metabolism 0.722 0.677 0.802 0.659
Gap junction 0.576 0.556 0.81 0.616
Gastric acid secretion 0.629 0.755 0.735 0.578
Glioma 0.514 0.662 0.552 0.533
Glutathione metabolism 0.662 0.549 0.65 0.615
Glycerolipid metabolism 0.675 0.656 0.592 0.612
Glycerophospholipid metabolism 0.677 0.67 0.628 0.591
Glycine, serine and threonine metabolism 0.666 0.715 0.697 0.739
Glycolysis / Gluconeogenesis 0.749 0.804 0.788 0.587
Glyoxylate and dicarboxylate metabolism 0.811 0.789 0.834 0.699
GnRH signaling pathway 0.632 0.587 0.725 0.654
Hematopoietic cell lineage 0.746 0.7 0.823 0.687
Hepatitis C 0.726 0.643 0.729 0.619
Huntingtons disease 0.803 0.79 0.895 0.745
Hypertrophic cardiomyopathy (HCM) 0.727 0.821 0.8 0.604
Inositol phosphate metabolism 0.584 0.633 0.565 0.659
Insulin signaling pathway 0.616 0.555 0.697 0.542
Jak-STAT signaling pathway 0.63 0.663 0.634 0.581
Leishmaniasis 0.723 0.664 0.779 0.684
Leukocyte transendothelial migration 0.779 0.729 0.796 0.652
Long-term depression 0.674 0.59 0.846 0.603
Long-term potentiation 0.567 0.703 0.807 0.611
Lysine degradation 0.827 0.776 0.66 0.606
Lysosome 0.738 0.772 0.768 0.589
Malaria 0.786 0.834 0.748 0.808
MAPK signaling pathway 0.587 0.614 0.651 0.515
Melanogenesis 0.62 0.573 0.886 0.637
Melanoma 0.515 0.572 0.668 0.625
Metabolic pathways 0.683 0.674 0.71 0.605
mRNA surveillance pathway 0.769 0.728 0.746 0.564
mTOR signaling pathway 0.763 0.637 0.654 0.588
N-Glycan biosynthesis 0.765 0.74 0.831 0.716
Natural killer cell mediated cytotoxicity 0.735 0.639 0.691 0.554
Neurotrophin signaling pathway 0.643 0.511 0.57 0.554
NOD-like receptor signaling pathway 0.662 0.651 0.738 0.578
Non-small cell lung cancer 0.613 0.53 0.777 0.571
Notch signaling pathway 0.502 0.639 0.668 0.569
Nucleotide excision repair 0.727 0.752 0.741 0.534
Oocyte meiosis 0.671 0.69 0.758 0.584
Osteoclast differentiation 0.725 0.638 0.726 0.601
Oxidative phosphorylation 0.939 0.967 0.937 0.777
p53 signaling pathway 0.63 0.703 0.582 0.738
Pancreatic cancer 0.589 0.628 0.572 0.582
Pancreatic secretion 0.766 0.753 0.842 0.555
Parkinsons disease 0.804 0.877 0.876 0.798
Pathogenic Escherichia coli infection 0.697 0.636 0.682 0.606
Pathways in cancer 0.558 0.55 0.667 0.534
Pentose phosphate pathway 0.811 0.871 0.725 0.72
Peroxisome 0.651 0.644 0.711 0.569
Phagosome 0.722 0.727 0.797 0.663
Phosphatidylinositol signaling system 0.565 0.534 0.621 0.628
Porphyrin and chlorophyll metabolism 0.571 0.74 0.593 0.562
PPAR signaling pathway 0.712 0.637 0.609 0.57
Prion diseases 0.655 0.712 0.9 0.663
Progesterone-mediated oocyte maturation 0.62 0.589 0.798 0.565
Propanoate metabolism 0.713 0.817 0.859 0.584
Prostate cancer 0.636 0.585 0.578 0.534
Protein digestion and absorption 0.877 0.947 0.859 0.852
Protein processing in endoplasmic reticulum 0.662 0.732 0.682 0.72
Purine metabolism 0.638 0.673 0.595 0.606
Pyrimidine metabolism 0.692 0.641 0.72 0.587
Pyruvate metabolism 0.828 0.766 0.842 0.507
Regulation of actin cytoskeleton 0.612 0.618 0.748 0.571
Renal cell carcinoma 0.606 0.551 0.689 0.555
Rheumatoid arthritis 0.62 0.679 0.765 0.607
Ribosome 0.969 0.976 0.954 0.83
RIG-I-like receptor signaling pathway 0.587 0.617 0.729 0.563
Salivary secretion 0.672 0.663 0.809 0.701
Shigellosis 0.574 0.673 0.738 0.555
Small cell lung cancer 0.601 0.708 0.7 0.596
SNARE interactions in vesicular transport 0.824 0.89 0.811 0.66
Sphingolipid metabolism 0.821 0.844 0.741 0.626
Staphylococcus aureus infection 0.841 0.794 0.938 0.907
Systemic lupus erythematosus 0.79 0.625 0.913 0.795
T cell receptor signaling pathway 0.625 0.613 0.734 0.604
Tight junction 0.598 0.632 0.784 0.514
Toll-like receptor signaling pathway 0.605 0.602 0.594 0.585
Toxoplasmosis 0.545 0.638 0.745 0.567
Tryptophan metabolism 0.697 0.713 0.692 0.63
Type II diabetes mellitus 0.626 0.683 0.755 0.593
Ubiquitin mediated proteolysis 0.605 0.577 0.649 0.601
Valine, leucine and isoleucine degradation 0.822 0.838 0.827 0.693
Vascular smooth muscle contraction 0.594 0.722 0.913 0.55
VEGF signaling pathway 0.606 0.56 0.767 0.571
Vibrio cholerae infection 0.67 0.765 0.84 0.657
Viral myocarditis 0.628 0.638 0.767 0.726
Wnt signaling pathway 0.594 0.659 0.595 0.603

CDAPMQ_ratiopaper

6.3 Sample class prediction

For each data table, machine learning models are built to predict sample class:LumA,LumB. In OmicsEV, random forest models are built and the models are evaluated using repeated 5 fold cross validation (20 times).

dataSet mean_ROC median_ROC sd_ROC
CDAP 0.7730 0.7731 0.0196
MQ_ratio 0.7804 0.7755 0.0150
paper 0.7434 0.7431 0.0135
RNA 0.9899 0.9898 0.0038

6.4 PCA with sample class annotation

CDAPMQ_ratiopaper

6.5 Unsupervised clustering

CDAPMQ_ratiopaper

7 Multi-omics concordance

7.1 Gene-wise mRNA-protein correlation

data table n n5 n6 n7 n8 gene_wise_cor
CDAP 8671 3195 1682 593 92 0.4203
MQ_ratio 8524 3135 1648 581 79 0.4235
paper 8893 2824 1508 541 70 0.3784

CDAPMQ_ratiopaper

7.2 Sample-wise mRNA-protein correlation

data table sample_wise_cor
CDAP 0.1916
MQ_ratio 0.2125
paper 0.1783